In this dataset there is following books:
* Upanishads - are ancient Sanskrit texts of spiritual teaching and ideas of Hinduism. They are the part of the oldest scriptures of Hinduism, the Vedas, that deal with meditation, philosophy, and spiritual knowledge; other parts of the Vedas deal with mantras, benedictions, rituals, ceremonies, and sacrifices.
Yoga Sutras - collection of 196 Sanskrit sutras* (aphorisms) on the theory and practice of yoga.
Buddha Sutras - were initially passed on orally by monks, but were later written down and composed as manuscripts in various Indo-Aryan languages which were then translated into other local languages as Buddhism spread
Tao Te Ching - fundamental text for both philosophical and religious Taoism. It also strongly influenced other schools of Chinese philosophy and religion, including Legalism, Confucianism, and Buddhism, which was largely interpreted through the use of Taoist words and concepts when it was originally introduced to China.
Old Testament :
Book of Wisdom - Solomon’s speech concerning wisdom, wealth, power and prayer
Book of Proverbs - Proverbs is not merely an anthology but a “collection of collections” relating to a pattern of life which lasted for more than a millennium. It is an example of the Biblical wisdom tradition, and raises questions of values, moral behaviour, the meaning of human life, and right conduct.
Book of Ecclesiastes - is one of 24 books of the Tanakh (Hebrew Bible), where it is classified as one of the Ketuvim (Writings).
Book of Ecclesiasticus - commonly called the Wisdom of Sirach or simply Sirach.
More on wikipedia
**in Indian literary traditions refers to an aphorism or a collection of aphorisms in the form of a manual or, more broadly, a condensed manual or text. Sutras are a genre of ancient and medieval Indian texts found in Hinduism, Buddhism and Jainism.*
Quick look at the data size:
cat('Number of features:', ncol(data))
## Number of features: 8267
cat('Number of records:', nrow(data))
## Number of records: 590
cat('Number of words in total:', sum(data[-1]))
## Number of words in total: 60609
To give some perspective polish classic ‘Pan Tadeusz’ has 68 682 words in total.
knitr::kable(
data[1:10, 1:10], caption = 'Dataset',
booktabs = TRUE
) %>%
kable_styling()
| X | foolishness | hath | wholesome | takest | feelings | anger | vaivaswata | matrix | kindled |
|---|---|---|---|---|---|---|---|---|---|
| Buddhism_Ch1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Buddhism_Ch10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Looking at data we are certain that there is no sense in keeping chapters separated. We than used stringi package to extract names of books. We figured that we combine biblical texts into one as they have significantly less chapters than the rest. Than we truncated it to have only one book per row (word occurances were summed). We ended up with this dataframe:
book_name <- stri_extract(data$X, regex = "^[a-zA-Z]+")
book_name <- ifelse(startsWith(book_name, "Bo"), "Bible",book_name)
data$book_name <- book_name
data <- data[,-1]
book_names <- unique(data$book_name)
df <- matrix(0, length(book_names), ncol = ncol(data)-1)
for (i in seq_along(book_names)){
row <- colSums(data[data$book_name == book_names[i],1:(ncol(data)-1)])
df[i,] <- row
}
df <- as.data.frame(df)
df <- cbind(book_names,df)
colnames(df) <- c( "book_name", colnames(data[,1:(ncol(data)-1)]))
m <- ncol(df)
knitr::kable(
df[1:5, 1:10], caption = 'Dataset',
booktabs = TRUE
) %>%
kable_styling()
| book_name | foolishness | hath | wholesome | takest | feelings | anger | vaivaswata | matrix | kindled |
|---|---|---|---|---|---|---|---|---|---|
| Buddhism | 0 | 0 | 0 | 0 | 19 | 0 | 0 | 0 | 0 |
| TaoTeChing | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Upanishad | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 1 |
| YogaSutra | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| Bible | 2 | 332 | 3 | 1 | 0 | 31 | 0 | 0 | 3 |
It is already better for visualization.
Most common words per book
for (bn in book_names){
tmp <- sort(df[df$book_name == bn, 2:m], decreasing = T)
barplot(height = unlist(tmp[10:1]),
las =2 ,
horiz = TRUE,
main = paste("Most frequent words in", bn),
cex.names=0.7,
col = "lightblue")
}
More interesting way to visualize words is word clouds
TeoTeChing
bn <- "TaoTeChing"
tmp <- unlist(df[df$book_name == bn, -1])
names(tmp) <- NULL
df2 <- data.frame(word = colnames(df[,-1]), freq = tmp)
set.seed(1234)
wordcloud(words = df2$word, freq = df2$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), main = bn)
Bible
bn <- "Bible"
tmp <- unlist(df[df$book_name == bn, -1])
names(tmp) <- NULL
df2 <- data.frame(word = colnames(df[,-1]), freq = tmp)
set.seed(1234)
wordcloud(words = df2$word, freq = df2$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), main = bn)
Buddhism
bn <- "Buddhism"
tmp <- unlist(df[df$book_name == bn, -1])
names(tmp) <- NULL
df2 <- data.frame(word = colnames(df[,-1]), freq = tmp)
set.seed(1234)
wordcloud(words = df2$word, freq = df2$freq, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), main = bn)
Upnishad
bn <- "Upanishad"
tmp <- unlist(df[df$book_name == bn, -1])
names(tmp) <- NULL
df2 <- data.frame(word = colnames(df[,-1]), freq = tmp)
set.seed(1234)
wordcloud(words = df2$word, freq = df2$freq, min.freq = 1,
max.words=80, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), main = bn)
YogaSutra
bn <- "YogaSutra"
tmp <- unlist(df[df$book_name == bn, -1])
names(tmp) <- NULL
df2 <- data.frame(word = colnames(df[,-1]), freq = tmp)
set.seed(1234)
wordcloud(words = df2$word, freq = df2$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), main = bn)
How our chapters look categorized in books look treated with TSNE
library(Rtsne)
library(ggplot2)
library(plotly)
tsne <- Rtsne(data[,1:8266], dims = 2, preplexity = 30, verbose=FALSE, max_iter = 500)
data_to_plot <- as.data.frame(tsne$Y)
data_to_plot$label <- book_name
ggplot(data_to_plot, aes(x = V1, y = V2, color = label)) +
geom_point() +
theme_bw() +
scale_color_manual(values = brewer.pal(8, "Set1"))
And how they look in 3d
tsne <- Rtsne(data[,1:8266], dims = 3, preplexity = 30, verbose=FALSE, max_iter = 500)
data_to_plot <- as.data.frame(tsne$Y)
data_to_plot$label <- book_name
plot_ly(data_to_plot, x = ~V1, y = ~V2, z = ~V3, color = ~label, size = 0.1)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
word lengths
d <- data.frame(word_len = NULL, book = NULL)
for (bn in book_names){
tmp_df <- df[df$book_name == bn,]
word_list <- sapply(colnames(tmp_df)[2:ncol(tmp_df)], function(x) rep(nchar(x), tmp_df[x]) )
word_list <- unlist(word_list)
names(word_list) <- NULL
p <- data.frame(word_len = word_list, book = rep(bn, length(word_list)))
d <- rbind(d, p)
}
ggplot(d, aes(x = word_len, fill = book)) + geom_density(adjust = 2, alpha = 0.5 ) + theme_minimal()
ggplot(d, aes(y = word_len, x= book, fill = book)) + geom_boxplot() + theme_minimal()
w <- colnames(data)
long <- parallelLapply(w, function(x){if(nchar(x)>15){x}else NULL})
long <- unlist(long)
d <- data[,long] %>% colSums() %>% as.data.frame()
d$names <- rownames(d)
colnames(d)[1] <- "occurrances"
d %>% arrange(desc(occurrances)) %>% head(10) %>% kable() %>%
kable_styling()
| occurrances | names |
|---|---|
| 16 | clingingaggregate |
| 13 | clingingaggregates |
| 8 | clingingsustenance |
| 6 | neitherpainfulnorpleasant |
| 5 | selfconsciousness |
| 3 | neitherpleasantnorpainful |
| 2 | noseconsciousness |
| 2 | incomprehensible |
| 2 | allconsciousness |
| 2 | eyeconsciousness |
*"These are the five clinging-aggregates: form as a clinging-aggregate, feeling as a clinging-aggregate, perception as a clinging-aggregate, fabrications as a clinging-aggregate, consciousness as a clinging-aggregate... These five clinging-aggregates are rooted in desire..."The Buddha:A certain monk: "Is it the case that clinging and the five clinging-aggregates are the same thing, or are they separate?"A certain monk:The Buddha: "Clinging is neither the same thing as the five clinging-aggregates, nor are they separate. Whatever desire & passion there is with regard to the five clinging-aggregates, that is the clinging there..."*
*"There are three kinds of feeling: pleasant feeling, painful feeling, & neither-pleasant-nor-painful feeling... Whatever is experienced physically or mentally as pleasant & gratifying is pleasant feeling. Whatever is experienced physically or mentally as painful & hurting is painful feeling. Whatever is experienced physically or mentally as neither gratifying nor hurting is neither-pleasant-nor-painful feeling... Pleasant feeling is pleasant in remaining and painful in changing. Painful feeling is painful in remaining and pleasant in changing. Neither-pleasant-nor-painful feeling is pleasant when conjoined with knowledge and painful when devoid of knowledge."*
–Bhuddhism
The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing.
get_words_out_of_bag <- function(df,n){
#' df - bag of words
#' n number of row to get words out
row <- df[n,] %>% select(which(df[n,] != 0))
cols <- colnames(row)
if(length(cols) == 1){return(list(NULL))}
words <- sapply(2:length(cols), function(i, row, cols){rep(cols[i], row[1,i])}, cols = cols, row=row, simplify = TRUE)
unlist(words)
}
get_words_in_books <- function(df, bookname){
slice <- df[df[,1] == bookname,]
all <- sapply(1:nrow(slice), get_words_out_of_bag, df=df, simplify = TRUE)
unlist(all)
}
parallelStartMulticore(detectCores())
## Starting parallelization in mode=multicore with cpus=4.
df <- read.csv('./AllBooks_baseline_DTM_Labelled.csv')
books <- stri_extract(df$X, regex = "^[a-zA-Z]+")
df$X <- books
starttime <- Sys.time()
inside <- parallelLapply(unique(books), get_words_in_books, df=df)
## Mapping in parallel: mode = multicore; level = NA; cpus = 4; elements = 8.
endtime <- Sys.time()
endtime - starttime
## Time difference of 3.746424 mins
rpivotTable(data = melted, cols = "sent", rows = "variable", rendererName = "Heatmap", aggregatorName = "Sum", vals = "value")
Let’s see with books of Bible treated as one:
df_sented_simp <- df_sented
df_sented_simp$Bible = (df_sented$BookOfEccleasiasticus +
df_sented$BookOfProverb +
df_sented$BookOfEcclesiastes +
df_sented$BookOfWisdom) / 4
df_sented_simp$BookOfProverb <- NULL
df_sented_simp$BookOfEcclesiastes <- NULL
df_sented_simp$BookOfEccleasiasticus <- NULL
df_sented_simp$BookOfWisdom <- NULL
melted <- reshape::melt(df_sented_simp)
## Using sent as id variables
rpivotTable(data = melted, cols = "sent", rows = "variable", rendererName = "Heatmap", aggregatorName = "Sum", vals = "value")